Premier League prediction using neural network

An example of a multivariate data type classification problem using Neuroph Studio

By Sandro Radovanović and Milan Radojičić, Faculty of Organization Sciences, University of Belgrade

In this experiment it will be shown how neural networks and Neuroph Studio are used when it comes to problems of classification. Several architectures will be tried out, and it will be determined which ones represent a good solution to the problem, and which ones do not.

Classification is a task that is often encountered in every day life. A classification process involves assigning objects into predefined groups or classes based on a number of observed attributes related to those objects. Although there are some more traditional tools for classification, such as certain statistical procedures, neural networks have shown to be an effective solution for this type of problems. There are a number of advantages to using neural networks - they are data driven, they are self-adaptive, they can approximate any function - linear as well as non-linear (which is quite important in this case because groups often cannot be divided by linear functions). Neural networks classify objects rather simply - they take data as input, derive rules based on those data, and make decisions.

The objective of this problem is to create and train neural network to predict whether home team wins, visitor team wins or it will be draw in Barclays Premier League, given some attributes as input. First we need data set. For this problem we choose results of Premier League season 2011/12. Because of great number of matches we randomly sampled 106 results. Each result has 8 input and 3 output attributes. Input attributes are:

Values of attributes depends of number of players playing on certain position and players rating. Therefor, for goalkeeper value is maximum 100. If there are four players in defence maximum value is 400 etc.

If we want to use this data set for classification, we need to normalize it. Type of neural network that will be used is multilayer perceptron with backpropagation.

All input attributes are have integer values that can be very distant from each other. For example, goalkeeper rating can have maximum value of 100, and attack rating can have maximum of 300. In that case attack rating will influence more on problem than goalkeeper. To prevent that we will normalize data set using Max-Min normalization formula.

Where B is the standardized value, and D and C determines the range in which we want our value to be. In this case, D = 1 and C = 0.

Normalized values are saved in PremierLeagueResults.txt file because they will be used for training and testing neural network

After normalizing all data we can start with Neuroph Studio. First we will create new Neuroph project.

The project will be named ’PredictPremierLeague’. After we have clicked 'Finish', a new project is created and it will appear in the 'Projects' window, in the top left corner of Neuroph Studio.

In order to neural network learn the problem we need traaining data set. The training data set consists of input signals assigned with corresponding target (desired output). The neural network is then trained using one of the supervised learning algorithms, which uses the data to adjust the network's weights and thresholds so as to minimize the error in its predictions on the training set. If the network is properly trained, it has then learned to model the (unknown) function that relates the input variables to the output variables, and can subsequently be used to make predictions where the output is not known.

Select training set file type and click 'Next'. After that, enter training set name. Select the type of supervised.

In general, if you use a neural network, you will not know the exact nature of the relationship between inputs and outputs – if you knew the relationship, you would model it directly. The other key feature of neural networks is that they learn input/output relationship through training. There are two types of training used in neural networks, with different types of networks using different types of training. These are supervised and unsupervised training, of which supervised is the most common. In supervised learning, the network user assembles a set of training data. The training data contains examples of inputs together with the corresponding outputs, and the network learns to infer the relationship between the two. In other words, supervised learning is used for classification. For an unsupervised learning rule, the training set consists of input training patterns only. Unsupervised learning, on other hand, is used for clustering.

Our, normalized, data set, that we create above, consists input and output values. Therefore we choose supervised learning. In field Number of inputs enter 8 and in field number of outputs enter 3 and click 'Next'.

Training set can be created in two ways. You can either create training set by entering elements as input and desired output values of neurons in input and output label, or you can create training set by choosing an option load file. The first method of data entry is time consuming, and there is also a risk to make a mistake when entering data. Since we already have training set we will choose second way.

Click on Choose File and find file named PremierLeagueresults.txt. Then select tab as values separator. In our case values have been separated with tab. In some other case values of data set can be separated on the other way. When finished, click on 'Load'.

Now we need to create neural network. In this experiment we will analyze several architecture. Each neural network which we create will be type of Multi Layer Perceptron and each will differ from one another according to parameters of Multi Layer Perceptron.

This is perhaps the most popular network architecture in use today: the units each perform a biased weighted sum of their inputs and pass this activation level through a transfer function to produce their output, and the units are arranged in a layered feedforward topology. The network thus has a simple interpretation as a form of input-output model, with the weights and thresholds (biases) the free parameters of the model. Such networks can model functions of almost arbitrary complexity, with the number of layers, and the number of units in each layer, determining the function complexity.

To create Multi Layer Perceptron network click File -> New File and select desired project from Project drop-down menu, Neural Network file type as you see in picture below.

We will call this network PremierLeague1 and we will select Multi Layer Perceptron.

In new Multi Layer Perceptron dialog enter number of neurons. The number of input and output units is defined by the problem, so you need to enter 8 as number of input neurons and 3 as number of output neurons.

The number of hidden units to use is far from clear. If too many hidden neurons are used, the network will be unable to model complex data, resulting in a poor fit. If too few hidden neurons are used, then training will become excessively long and the network may overfit.

How about the number of hidden layers? For most problems, one hidden layer is normally sufficient. Therefore, we will choose one hidden layer. The goal is try to quickly find the smallest network that converges and then refine the answer by working back from there. Because of that, we will start with 2 hidden neurons and if the network fails to converge after reasonable period, we will restart training up to ten times, thus ensuring that it has not fallen into local minimum. If the network still fails to converge we will add another hidden neuron and repeat procedure.

Further, we check option 'Use Bias Neuron'. Bias neurons are added to neural networks to help them learn patterns. A bias neuron is nothing more than a neuron that has a constant output of 1. Because the bias neurons have a constant output of one they are not connected to the previous layer. The value of 1, which is called the bias activation, can be set to values other than 1. However, 1 is the most common bias activation.

If your values in the data set are in the interval between -1 and 1, choose Tanh transfer function. In our data set, values are in the interval between 0 and 1, so we used Sigmoid transfer function.

As learning rule choose Backpropagation With Momentum. Backpropagation With Momentum algorithm shows a much higher rate of convergence than the Backpropagation algorithm. Choose Dynamic Backpropagation algorithm if you have to training dynamic neural network, which contain both feedforward and feedback connections between the neural layers.

If you want to see neural network as a graph, just select 'Graph View'. Right nodes in first and second level are bias neurons that we explained above.

If we choose 'Block View' and look at the top left corner of View screen we will see that training set is empty. To traing Neural Network we need to put training data in that corner. To do that we will just click on training set that we created and click 'Train'. A new window will open, where we need to set the learning parameters, learning rate and momentum.

Next thing we should do is determine the values of learning parameters, learning rate and momentum.

Learning rate is one of the parameters which governs how fast a neural network learns and how effective the training is. Let us assume that the weight of some synapse in the partially trained network is 0.2. When the network is introduced with a new training sample, the training algorithm demands the synapse to change its weight to 0.7 (say) so that it can learn the new sample appropriately. If we update the weight straightaway, the neural network will definitely learn the new sample, but it tends to forget all the samples it had learnt previously. This is because the current weight (0.2) is a result of all the learning that it has undergone so far. So we do not directly change the weight to 0.7. Instead, we increase it by a fraction (say 25%) of the required change. So, the weight of the synapse gets changed to 0.3 and we move on to the next training sample. Proceeding this way, all the training samples are trained in some random order. Learning rate is a value ranging from zero to unity. Choosing a value very close to zero, requires a large number of training cycles. This makes the training process extremely slow. On the other hand, if the learning rate is very large, the weights diverge and the objective error function heavily oscillates and the network reaches a state where no useful training takes place.

The momentum parameter is used to prevent the system from converging to a local minimum or saddle point. A high momentum parameter can also help to increase the speed of convergence of the system. However, setting the momentum parameter too high can create a risk of overshooting the minimum, which can cause the system to become unstable. A momentum coefficient that is too low cannot reliably avoid local minima, and can also slow down the training of the system.

There are two stopping criteria. One is maximum error and second one is maximum number of learning iterations, which are intuitively clear.

We can see in pictures below that training was unsuccesfull. After 75325 iterations Neural Network failed to learn problem with error less than 0,01. We can test this network but error will be greater than expected.

After the network is trained, we click 'Test', in order to see the total error, and all the individual errors. The result show us that total mean square error is aproximatly 0.17, which is to much. Individual error are also pretty big. Lets look at last result. Values of output are 0.1732, 0.1121 and 0.7603 but result should be 1,0,0. With this information we can conclude that this Neural Network is not good enough.

So let we try something else. We will update the weight of learning rate and increase it by 25%. In network window click Randomize button and then click Train button. That means that we will set value of 0.2 in learning rate label replace with a new value 0.3 and click 'Train' button.

After training Network with these parameters we got better results but still not good enough.

Increasing the value of learning rate we conclude that the objective error function oscillates more and the network reaches a state where no useful training takes place.

In the table below for the next three sessions we will present the results of other trainings for the first architecture. For other trainings is not given graphic.

Based on data from Table 1 can be seen that regardless of the parameters of training error do not falls below a specified level, even if we train the network through a different number of iterations. This may be due to the small number of hidden neurons. In the following solution we will increase the number of hidden neurons.

Next Neural Network will have same number of input and output neurons but different number of neurons in hidden layer. We will use 4 hidden layer neurons. Network in named PremierLeague2.

First training course, of second architecture, we will start with extremely low values of learning rate and momentum. First click on button 'Train'. In 'Set Learning parameters' dialog, field 'set Stopping criteria enter 0.01 as max error. In order to graphically display, the training of this network, was clearer. In field 'set Learning parameters', enter 0.01 for 'Learning rate' and 0.05 for 'Momentum'. After entering this values click on button 'Train'.

During the testing we unsuccessfully trained the neural network named PremierLeague2. The summary of the results are shown in the Table 2.

From the graphics above can be seen from iteration to iteration there are no large shifts in the prediction. More accurate in predicting, fluctuations are very small and the values are around 0.1. Reason for such a small fluctuation is that the learning rate is very close to zero. Also because of such a small coefficient, of the learning rate, neural network has no the ability to learn quickly. On the other hand small value of momentum slows down the training of the system.

Like in last attempt we will try extremely high values of learning rate and momentum. Compared to previous training, we will just replace the values of learning rate and momentum. For learning rate we will enter 0.9 and for momentum also 0.9. Other options will be the same as in the previous training.

During the testing we unsuccessfully trained the neural network named PremierLeague2. The summary of the results are shown in the Table 2.

In picture below we see distinction between small values and large values of learning parameters. We set the momentum parameter too high and we have created a risk of overshooting the minimum, which caused the system to become unstable. On the other hand, the learning rate is very large, the weights diverge and the objective error function heavily oscillates and the network reaches a state where no useful training takes place.

In previous two attempts we used extreme values of learning parameters, so this time we will use recommended values. That is, 0.2 for learning rate and 0.7 for momentum.

Following useful conclusion can be drawn from this training. We can see that the architecture of four hidden neurons is not appropriate for this training set, because for continuing the training of the neural network we do not get the desired approximation of max error. Error is still much higher than desired level.

The oscillations are less than second training (which was expected because the parameters of training is less than in the previous case), but on the other side neural network has no the ability to learn quickly and the training of the system is slow (just like in first training).

In the table below for the previous three sessions we will present the results of all trainings for the second architecture.

After several tries with different architecture and parameters we got results that are given in table 3. There is interesting pattern in data. If we look number of hidden neurons and total net eror we can see that higher number of neurons leads us to lesser total net error.

This neural network will contain 16 neurons in hidden layer, as we see in picture below, and same options as previous networks.

First we will try with recommended values for learning rate and momentum. That is, 0.2 for learning rate and 0.7 for momenum.

During the testing we successfully trained the neural network named PremierLeague6. The summary of the results are shown at the final table at the end of this article.

The total net error slowly descends but with high oscilation and finally stops when reaches a level lower than a given (0.01) in 4876 iteration.

Total Mean Square Error measures the average of the squares of the "errors". The error is the amount by which the value implied by the estimator differs from the quantity to be estimated. An mean square error of zero, meaning that the estimator predicts observations of the parameter with perfect accuracy, is the ideal, but is practically never possible. The unbiased model with the smallest mean square error is generally interpreted as best explaining the variability in the observations. The test showed that total mean square is 0.0071640672790769504. The goal of experimental design is to construct experiments in such a way that when the observations are analyzed, the mean square error is close to zero relative to the magnitude of at least one of the estimated treatment effects.

Now we need to examine all the individual errors for every single instance and check if there are any extreme values. When you have a large data set, individual testing requires a lot of time. Instead of testing 106 observations we will random choose 5 observations which will be subjected to individual testing. Three following table will show the value of input, output and errors in 5 randomly selected observations. These values are taken from the window Test Results.

In introduction we mentioned that result can belong to one of three groups. So if home team won output would be 1, 0, 0, if away team wins it would be 0, 0, 1 and they played draw output would be 0, 1, 0. After completion of testing would be ideal if the value of output after the test were the same as the output values before testing. As with other statistical methods, and classification using neural networks include errors that arise during the approximation. Individual error between the original and the assessed values are shown in Table 4.3.

For observation 3 and 63 we can say that there is reasonable mistake in classification. That is they are bigger than 1%. Therefor, we will continue training neural network by increasing learning rate to 0.3 (by 25%).

At the beginning we said that the goal is try to quickly find the smallest network that converges and then refine the answer by working back from there. Since we find the smallest neural network do the following:

After 591 iterations total net error is 0.005 and total mean square error is 0.02. But what is the most interesting are the values of errors of observations. They are given in table 4.3.

Because this network learned data perfectly individual error will be equals to zero as we see in table 4.5.

Recommendation: If you do not get the desired results, continue to gradually increase the training parameters. The neural network will definitely learn the new sample, and it would not forget all the samples it had learnt previously.

Advanced Training Techniques

When the training is complete, you will want to check the network performance. A learning neural network is expected to extract rules from a finite set of examples. It is often the case that the neural network memorizes the training data well, but fails to generate correct output for some of the new test data. Therefore, it is desirable to come up with some form of regularization.

One form of regularization is to split the training set into a new training set and a validation set. After each step through the new training set, the neural network is evaluated on the validation set. The network with the best performance on the validation set is then used for actual testing. Your new training set should consist of 80% - 90% of the original training set, and the remaining 10% - 20% would be classified in the validation set. Then you have to compute the validation error rate periodically during training and stop training when the validation error rate starts to go up. However, validation error is not a good estimate of the generalization error, if your initial set consists of a relatively small number of instances. Our initial set, we named it PremierLeague, consists of only 106 instances. In this case 10% or 20%, of the original training set, consisted of the 10 or 20 instances. This is the insufficient number of instances to perform validation. In this case instead validation we will use a generalization as a form of regularization.

One way to get appropriate estimate of the generalization error is to run the neural network on the test set of data that is not used at all during the training process. The generalization error is usually defined as the expected value of the square of the difference between the learned function and the exact target.

In the following examples we will check the generalization error, such as from the example to the example we will increase the number of instances in the training set, which we use for training, and we will decrease the number of instances in the sets that we used for testing.

We will choose random 70% of instances of training set for training and remaining 30% for testing. First group will be called PremierLeague70, and second PremierLeague30.

Unlike previous training, now there is no need to create new neural network. Advanced Training Techniques consist in the fact that we examine the performance of existing architectures, using a new training and test set of data. Satisfactory results we found using architecture PremierLeague6. By the end of this article we will use not only this architecture, but also the parameters of the training that we used in this architecture previously which brought us desired results. But before you open an existing architecture, create new training sets. First training set name it PremierLeague70 and second one name it PremierLeague30.

Now open neural network PremierLeague6, select training set PremierLeague70 and in new network window press button 'Train'. The parameters that we now need to set will be the same as the ones in previous training attempt: the maximum error will be 0.01, the Learning rate 0.2, and the Momentum 0.7. We will not limit the maximum number of iterations, and we will check 'Display error graph', as we want the see how the error changes throughout the iteration sequence. Then press 'Train' button again and see what will happen.

Although, problem contained fewer instances it took 11372 iterations to train this network. Because it managed to converge to total net error of 0.01 we can declare this training succesfull.

After successful training the neural network, we can test the same to discover wheter the results will be as good as the previous testing.

Unlike previous practice, where we have to train and test neural networks using the same training set, now we will use the second training set, named PremierLeague30, to test network in which there are data that a neural network has not been seen.

So go to network window, select training set PremierLeague30 and press button 'Test'.

Total Mean Square Error is 0.07 which is 0.06 or 6% higher than desired error. Percentage is not that big especially if we consider that this is sports prediction, but we should look at individual error to see are there any one result than is completly mistaken. If we look at rows 8, 15, 17, 20, 22 and 25 we can see that they make mistakes. Cases 22 and 25 are completle mistaken, but cases 8, 15 and 17 are interesting. They are classified correctly but there is high valueof membership to other group.

From this the conclusion is drawn that the neural network memorizes the training data well, but fails to generate correct output for some of the new test data. The problem may lie in the fact that we used 25 instances for the test vs. 81 instances that are used to train neural network. So how many data should be used for testing? Some authors believe that the 10% could be a practical choice. We will create four new training sets. More precisely we will make two training set to train and two training set to test the same architecture. Two training sets, which we use to train the network, will consist of 80% and 90% of the initial instances of our original training set, and the remaining two training sets, which we use to test the network, will consist of 20% and 10% of the initial instances of our original training set. Final results of the advanced training you can see in Table 4. And further we will restricted to a maximum error of 0.01, 0.3 for learning rate and 0.7 for momentum.

Training attempt	Neural Network	Training set	Testing set	Iterations	Total Net Error (during training)	Total Mean Square Error (during testing)
17.	PremierLeague6 (16 neurons in hidden layer)	70%	30%	11372	0.0099	0.0744
18.	PremierLeague6	80%	20%	3743	0.0099	0.1118
19.	PremierLeague6	90%	10%	2918	0.0086	0.1701
21.	PremierLeague7 (20 neurons in hidden layer)	70%	30%	2313	0.0091	0.1650
22.	PremierLeague7	80%	20%	1881	0.0086	0.1384
23.	PremierLeague7	90%	10%	2462	0.0084	0.1808
26.	PremierLeague8 (30 neurons in hidden layer)	70%	30%	2467	0.0098	0.1605
27.	PremierLeague8	80%	20%	2698	0.0098	0.1670
28.	PremierLeague8	90%	10%	2097	0.0094	0.1952

After 17th training attempt we concluded that there are some cases that makes big impact on Total Mean Squared Error. In 18th attempt we found four big errors and one correctly classified with big error (out of 17), and in 19th attempt we have one big error and one correctly classified with big error (out of eight). With big error we mean that network classified completly wrong (for example it is 1, 0, 0 but it should be 0, 1, 0) and that error makes huge impact on Total Mean Square Error.

Because all of these network failed to make error less than 0.01 we can say that this network failed to generalize this problem.

After finding initial neural network (PremierLeague6), every neural network with more neurons in hidden layer does not improve overall prediction.

During this experiment, we created six different architectures, one basic training set and six training sets derived from the basic training set. We normalize the original data set using a Max-min normalization or linear scaling method. Through six basic steps we explained in detail the creation, training and testing neural networks. If the network architecture using a small number of hidden neurons training will become excessively and the network may over fit no matter what are the values of training parameters.

Training attempt 1

Training attempt 2

Training attempt 5

Training attempt 6

Training attempt 7

Training attempt 16

Advanced Training Techniques

Training attempt 17